A note on the Bayesian regret of Thompson Sampling with an arbitrary prior

نویسندگان

Sébastien Bubeck

Che-Yu Liu

چکیده

We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We show that for any prior distribution, the Thompson Sampling strategy achieves a Bayesian regret bounded from above by 14 √ nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by 1 20 √ nK. In this paper we are interested in the Bayesian multi-armed bandit problem which can be described as follows. Let π0 be a known distribution over some set Θ, and let θ be a random variable distributed according to π0. For i ∈ [K], let (Xi,s)s≥1 be identically distributed random variables taking values in [0, 1] and which are independent conditionally on θ. Denote μi(θ) := E(Xi,1|θ). Consider now an agent facing K actions (or arms). At each time step t = 1, . . . n, the agent pulls an arm It ∈ [K]. The agent receives the reward Xi,s when he pulls arm i for the sth time. The arm selection is based only on past observed rewards and potentially on an external source of randomness. More formally, let (Us)s≥1 be an i.i.d. sequence of random variables uniformly distributed on [0, 1], and let Ti(s) = ∑s t=1 1It=i, then It is a random variable measurable with respect to σ(I1, X1,1, . . . , It−1, XIt−1,TIt−1 (t−1), Ut). We measure the performance of the agent through the Bayesian regret defined as

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Note on Information-Directed Sampling and Thompson Sampling

This note introduce three Bayesian style Multi-armed bandit algorithms: Information-directed sampling, Thompson Sampling and Generalized Thompson Sampling. The goal is to give an intuitive explanation for these three algorithms and their regret bounds, and provide some derivations that are omitted in the original papers.

متن کامل

Thompson Sampling for Online Learning with Linear Experts

In this note, we present a version of the Thompson sampling algorithm for the problem of online linear generalization with full information (i.e., the experts setting), studied by Kalai and Vempala, 2005. The algorithm uses a Gaussian prior and time-varying Gaussian likelihoods, and we show that it essentially reduces to Kalai and Vempala’s Follow-thePerturbed-Leader strategy, with exponentiall...

متن کامل

Complex Bandit Problems and Thompson Sampling

We study stochastic multi-armed bandit settings with complex actions derived from the basic bandit arms, e.g., subsets or partitions of basic arms. The decision maker is faced with selecting at each round a complex action instead of a basic arm. We allow the reward of the complex action to be some function of the basic arms’ rewards, and so the feedback observed may not necessarily be the rewar...

متن کامل

Information Directed Sampling and Bandits with Heteroscedastic Noise

In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy function evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and satisfies a tail bound taken uniformly on the domain. In this work, we consider the setting of heteroscedastic noise, that is, we explicitly allow the noise distribution to depend on ...

متن کامل

Nonparametric General Reinforcement Learning

Reinforcement learning problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider reinforcement learning in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do w...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1304.5758 شماره

صفحات -

تاریخ انتشار 2013

A note on the Bayesian regret of Thompson Sampling with an arbitrary prior

نویسندگان

چکیده

منابع مشابه

A Note on Information-Directed Sampling and Thompson Sampling

Thompson Sampling for Online Learning with Linear Experts

Complex Bandit Problems and Thompson Sampling

Information Directed Sampling and Bandits with Heteroscedastic Noise

Nonparametric General Reinforcement Learning

عنوان ژورنال:

اشتراک گذاری